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Abstract 

Children  face  an  enormously  difficult  task  in  learning  their  na¬ 
tive  language.  It  is  widely  believed  that  they  do  not  receive  or 
make  little  use  of  negative  evidence  (Marcus,  1993),  and  yet 
it  has  been  proven  that  many  classes  of  languages  less  pow¬ 
erful  than  natural  languages  cannot  be  learned  in  the  absence 
of  negative  evidence  (Gold,  1964).  In  this  paper  we  present 
an  approach  to  learning  good  approximations  to  members  of 
one  such  class  of  languages,  the  regular  languages,  based  on 
positive  evidence  alone. 


1.  Introduction 

The  ability  to  communicate  through  spoken  language  is 
widely  regarded  as  the  hallmark  of  human  intelligence.  Chil¬ 
dren  acquire  their  native  tongue  with  remarkable  ease,  mas¬ 
tering  the  vast  majority  of  that  language  before  they  enter 
school.  However,  the  facility  with  which  children  acquire 
language  belies  the  complexity  of  the  task.  For  example,  chil¬ 
dren  clearly  receive  positive  evidence  (examples  of  sentences 
in  the  language),  but  it  is  widely  believed  that  children  do  not 
receive  negative  evidence  (examples  of  sentences  that  are  not 
in  the  language  and  that  are  somehow  marked  as  such)  (Mar¬ 
cus,  1993).  The  difficulty  with  respect  to  learnability  arises 
with  a  now  famous  theorem  due  to  Gold(1967).  He  proved 
that  several  classes  of  languages,  including  regular,  context 
free  and  context  sensitive,  can  be  identified  in  the  limit  when 
the  learner  has  access  to  both  positive  and  negative  evidence. 
However,  those  same  classes  of  languages  cannot  be  learned 
from  positive  evidence  alone.  How  do  children  overcome 
Gold's  theoretical  hurdle? 

Difficulties  such  as  the  one  above  led  Chomsky(1975)  to 
suggest  that  language  is  innate,  that  it  is  not  learned  per  se 
but  that  facility  with  language  grows  and  matures  much  in  the 
same  way  that  one’s  organs  are  genetically  predetermined  to 
grow  and  mature.  In  this  paper  we  explore  the  possibility  that 
language  is  not  innate  by  developing  algorithms  for  learning 
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good  approximations  of  regular  languages  from  positive  evi¬ 
dence  alone.  Although  no  natural  language  is  strictly  regular, 
large  subsets  of  natural  languages  are  regular,  and  this  class 
of  languages  is  the  simplest  one  covered  by  Gold’s  theorem. 
As  such,  it  seemed  like  a  good  place  to  start  our  investigations 
into  learnability. 

Our  approach  to  learning  regular  languages  begins  with  a 
class  of  languages,  called  Szilard  (Makinen,  1997),  that  can 
be  learned  from  positive  evidence  alone.  Given  examples  of 
sentences  generated  by  an  arbitrary  regular  language,  we  as¬ 
sume  that  the  language  is  Szilard,  yielding  a  representation  of 
the  language  that  simply  “memorizes”  the  input  and  does  no 
generalization.  We  then  apply  heuristic  techniques  to  create 
increasingly  more  compact  representations  that  maintain  the 
Szilard  property  and  that  become  better  approximations  to  the 
target  language. 

The  remainder  of  the  paper  is  organized  as  follows.  Section 
2  briefly  reviews  deterministic  finite  automata,  their  relation¬ 
ship  to  regular  grammars,  and  how  a  Szilard  language  can  be 
learned  from  positive  evidence  alone.  Section  3  explains  our 
algorithm  for  learning  a  good  approximation  of  an  arbitrary 
regular  language  from  positive  evidence.  Section  4  describes 
the  implementation  of  the  algorithm  and  Section  5  presents 
experiments  and  results.  Finally,  Section  6  concludes  and 
points  to  future  research  directions. 

2.  Finite  automata,  trivial  DFAs  and  Szilard 
regular  grammars 

Deterministic  finite  automata(DFA)  are  well  known,  sim¬ 
ple  computing  devices  that  are  described  by  the  tuple: 

(Set_of_States,  Alphabet,  Transition_Function 
Start_State,  Set_of_Final_States  ) 

The  finite  automata  are  equivalent  to  regular  grammars  and 
thus  recognize  precisely  the  class  of  regular  languages.  Their 
functioning  is  described  by  a  transition  function: 

S  (current,  state,  cur  rent -input  symbol)  =  next  state 
which  is  equivalent  to  a  set  of  productions: 

{current-state  — )•  current  -input  symbol  nextstate}. 
The  input  symbols  are  elements  of  the  alphabet  and  are  usu¬ 
ally  called  terminals.  A  DFA  can  be  visualized  by  its  asso¬ 
ciated  graph,  as  in  figure  1 :  the  nodes  are  the  states  and  the 
arcs  represent  the  transitions.  The  labels  on  the  arcs  are  the 
input  symbols.  A  sentence  is  said  to  be  accepted  by  the  au- 
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Figure  1 :  A  small  deterministic  finite  automaton. 

tomaton  if  there  is  a  path  from  the  start  state  to  a  final  state, 
labeled  with  the  words  of  the  sentence.  DFA  induction  is  the 
problem  of  finding  the  automaton  that  describes  a  given  lan¬ 
guage  from  a  set  of  examples:  sentences  whose  membership 
to  the  language  is  known.  Gold's  famous  theorem  on  lan¬ 
guage  learnability  states  that  any  class  of  regular  languages 
(and  other  languages)  can  be  learned  from  positive  and  nega¬ 
tive  examples,  but  not  from  positive  examples  only,  if  at  least 
one  infinite  language  exists  in  the  class. 

Any  finite  language  can  be  represented  by  a  trivial  DFA, 
that  has  a  distinct  state  for  each  word  occurrence  in  each  sen¬ 
tence,  meaning  that  the  learner  memorizes  all  the  sentences. 

A  Szilard  regular  grammar  has  the  property  that  its  pro¬ 
ductions  have  the  form  Aj  — »•  ciijAj,  meaning  that  each  ter¬ 
minal  appears  on  one  arc  only.  It  follows  that  the  number  of 
terminals  equals  the  number  of  productions.  The  number  of 
states  (without  the  start  state)  is  at  most  equal  to  the  number 
of  productions.  The  inference  algorithm  that  finds  a  Szilard 
grammar  from  a  set  of  positive  examples,  is: 

•  assume  that  the  number  of  states  is  equal  to  the  number 
of  productions;  associate  each  state  with  a  terminal  a  and 
name  it  accordingly.  A:  A  — >  aX,  where  “X”  stands  for 
the  unknown  next  state;  if  a  terminal  is  the  first  one  in  a 
sentence,  then  its  associated  state  is  the  start  state;  if  it  is 
the  last  one  in  the  sentence,  it  is  followed  by  the  final  state. 

•  starting  from  the  start  state,  follow  the  derivations  for  the 
given  examples  and  merge  the  states  which  follow  the  same 
terminal;  continue  the  process  until  a  Szilard  grammar  is 
obtained. 

The  algorithm  finds  the  target  DFA,  provided  that  it  sees  all 
the  possible  consecutive  transitions  (a.,; ,  a,j ) .  For  example,  the 
DFA  in  figure  1  can  be  learned  from  three  examples  {aceb, 
adeb,  ab},  even  if  its  language  is  infinite.  The  first  step  yields 
the  state  sequences:  (SO  CEB  FS)  ,  (SO  D  E  B  FS  )  and  (SO 
B  FS).  Because  the  states  B,  C  and  D  all  follow  the  terminal 
“a”,  they  are  merged  into  state  (BCD)  and  the  desired  DFA 
is  obtained.  An  even  simpler  grammar  which  can  be  inferred 
from  positive  examples  only  has  the  property  that  each  termi¬ 
nal  uniquely  identifies  the  next  state.  The  inference  algorithm 
is  immediate: 

•  assign  one  variable  A  to  each  terminal  a  to  obtain  produc¬ 
tions  of  the  type  X  — >  aA,  where  “X”  is  the  unknown 
current  state; 

•  from  the  start  state,  for  a  sequence  (  a  b  . . .  ),  recover  the 
productions:  SO  — >a  A,  A  — >-b  B,  . . . 

Through  an  abuse  of  notation,  we  will  denote  this  second 
grammar  Szilard*  and  it  and  the  first  one  collectively  Szilard. 


It  can  be  noticed  though,  that  the  Szilard  regular  languages 
are  included  in  the  Szilard*  regular  languages,  because  the 
property  that  a  terminal  uniquely  identifies  a  transition  be¬ 
tween  a  pair  of  states  implies  that  it  also  identifies  the  next 
state. 

We  do  not  know  if  a  natural  language  or  at  least  a  part  of 
it  can  be  described  by  a  Szilard  grammar,  but  it  can  be  re¬ 
garded  as  such  for  the  purpose  of  learning  a  grammar  from 
positive  examples.  The  trivial  DFA  for  a  set  of  examples  has 
the  Szilard  property  if  each  terminal  occurrence  is  considered 
distinct.  For  example,  in  the  sentences  {“the i  boys  see  the-2 
cat”,  “the 3  girls  walk”}  the  three  occurrences  of  “the”  are 
considered  different.  Unfortunately  this  does  not  solve  the 
problem  yet,  because  we  got  the  desired  Szilard  property,  but 
the  trivial  DFA  is  not  the  language  representation  we  want 
to  leant.  The  two  sentences  above  suggest  that  actually  not 
all  occurrences  of  “the”  should  be  deemed  different.  While 
“the i”  and  “the-2'’  are  different,  one  determining  the  noun 
in  the  subject  and  the  other  in  the  direct  object,  “the i”  and 
“the 3”  both  determine  the  subject  noun  and  so  should  belong 
to  the  same  terminal.  It  follows  that  we  need  an  algorithm 
that  partitions  word  occurrences  into  classes  that  can  be  asso¬ 
ciated  with  grammar  terminals. 


Terminal  Merging  in  a  Szilard  DFA  preserves  the  property  :  one 
terminal  appears  on  one  arc  only. 


Terminal  Merging  in  a  Szilard*  DFA  preserves  the  property  :  the  same  terminal 
always  leads  to  the  same  state. 


Figure  2:  Terminal  merging 


An  interesting  property  of  the  Szilard  regular  grammars  is 
that  if  we  conflate  two  terminals  and  then  merge  their  asso¬ 
ciated  states,  the  grammar  remains  Szilard.  As  can  be  seen 
in  figure  2,  if  we  merge  the  terminals  “the  1”  and  “the 3” 
into  one  terminal  “the  1,3”  and  the  states  {so,se}  — >■  so,6  > 
{.s;;.  .S'7 }  — >  S3  7  for  the  Szilard  DFA,  the  resulted  automa¬ 
ton  retains  its  defining  property,  namely  that  each  terminal 
appears  on  just  one  arc.  The  Szilard*  property  is  also  pre¬ 
served  if  the  two  terminals  “the  1”  and  “the 3”  and  the  states 
{.so ,  .3 (; }  are  merged.  This  leads  to  the  following  strategy  for 
addressing  the  problem  of  grammar  induction,  by  breaking  it 
into  two  subproblems: 

•  “Szilard-ify”  the  language  by  immediately  constructing  the 
trivial  DFA. 

•  “compact”  the  trivial  DFA,  by  merging  word  instances,  and 
the  associated  states. 


3.  Approach  and  Algorithms 

It  follows  that  a  device  is  needed  that  can  distinguish  word  oc¬ 
currences  up  to  classes  associated  with  the  grammar’s  termi¬ 
nals.  From  these  classes,  a  Szilard  regular  grammar  can  then 
be  inferred,  in  the  form  of  a  deterministic  finite  automaton 
(DFA).  It  was  reported  in  (Elman,  1990),  that  an  Elman-type 
recurrent  neural  network(ra«)  can  classify  the  word  instances 
in  a  partition  that  reflects  grammatical  categories.  We  will  use 
this  device  to  extract  representations  of  word  instances  which 
are  suitable  for  the  task  of  identifying  terminal  classes.  These 
representations  will  then  serve  as  input  to  a  DFA  extraction 
algorithm.  The  generic  Elman-type  rnn  is  presented  in  fig¬ 
ure  3:  the  hidden  layer  encodes  the  current  network  state  and 
the  context  (recurrent)  units  maintain  a  copy  of  the  previous 
state.  The  output  layer  encodes  the  probability  distribution  of 
the  next  word  for  the  current  input  symbol.  The  network  as  a 
dynamical  system  is  described  by  the  state  function  F  and  its 
output  is  given  by  the  function  O: 

nextst.ate  =  F  (cur  rent  -input,  symbol,  current  state) 

V  (next-Word\current-Word)  =  O  (current  state) 


one  input  unit  per 
vocabulary  word 


output (t) 


one  input  unit  per  one  state  unit  per 

vocabulary  word  grammar  variable 


Figure  3:  Elman  recurrent  network. 

The  run  state  function  F  is  similar  with  the  transition  func¬ 
tion  of  a  finite  automaton,  and  so  it  appears  to  be  convenient 
to  draw  a  one-to-one  correspondence  between  rnn  and  DFA 
states,  thus  solving  the  problem  of  grammar  induction.  Un¬ 
fortunately  this  is  not  immediately  possible,  the  main  reason 
being  that  the  network  states  belong  to  a  continuous  space, 
while  the  DFA  states  are  discrete.  One  method  used  so  far  for 
DFA  inference  from  ra«(Giles  et  al.  1992)  is  to  assume  that 
the  network  states  which  are  close  in  the  continuous  space 
form  clusters  that  represent  the  automaton  states.  This  pro¬ 
cedure  is  immediately  equivalent  to  extracting  the  Szilard* 
DFA,  where  the  terminal  classes  uniquely  identify  the  states. 
Both  positive  and  negative  examples  were  used  in  the  men¬ 
tioned  work.  Kolen(1994)  argued  that  the  network  states  can¬ 
not  be  mapped  directly  onto  the  DFA  states  because  of  an  in¬ 
stability  of  the  dynamic  system  represented  by  the  recurrent 
network,  that  makes  the  DFA  state  encoding  in  the  hidden 
layer  shift  in  time.  We  can  address  this  problem  by  resetting 
the  network  before  the  beginning  of  each  sentence.  Another 
problem  is  that  of  local  minima:  due  to  its  huge  parameter 


space,  the  network  can  use  similar  state  vectors  for  either  the 
same  or  different  DFA  states  and  still  encode  the  desired  out¬ 
put. 

Since  our  goal  is  to  extract  grammatical  categories  from 
the  network,  we  will  look  at  ways  of  distinguishing  word  in¬ 
stances,  based  on  the  rnn  states,  rather  than  trying  directly  to 
extract  DFA  states.  The  DFA  can  be  extracted  afterwards,  due 
to  the  assumed  Szilard  property.  If  we  consider  each  word  oc¬ 
currence  to  be  represented  by  the  hidden  layer  as  in  (Elman, 
1990),  we  might  lose  the  information  encoded  in  the  previous 
state.  The  example  in  figure  4  illustrates  how  two  different 
words  can  get  similar  representations,  because  of  the  con¬ 
straints  imposed  on  the  hidden  layer  by  the  output  function: 
( h  =  F(see,Ai ))  ps  (h!  =  F(sees,Aj))  because  0(h)  must 
equal  O(h'),  so  “see”  and  “sees”  can  be  clustered  together. 


similar  representations  if 
based  only  on  the  next  word 
probability  distribution 


the  I  I  the 


see  I  I  state_jG~|  I  sees  ~|  I  state_Aj~ 


Figure  4:  How  “see”  and  “sees”  can  get  the  same  repre¬ 
sentation.  The  weights  can  encode  a  function  F  such  that 

F(see,Aj )  ps  F(sees,  Aj) 

This  problem  can  be  viewed  as  a  miss-representation  of 
the  DFA  states  in  the  hidden  layer.  It  turns  out  that  states,  and 
thus  word  occurrences,  can  be  better  distinguished  by  the  dif¬ 
ferent  paths  that  lead  to  them.  A  simple  way  of  achieving  this 
is  by  considering  the  concatenation  of  the  context  layer  (pre¬ 
vious  network  state)  with  the  hidden  layer  (current  state)  as 
the  representation  of  word  instances.  Even  so,  the  problems 
of  non-discrete  and  eventually  falsely  similar  vectors  of  word 
occurrences  remain.  It  follows  that  two  other  processing  steps 
are  needed: 

•  network  state  clustering,  using  the  Euclidian  distance,  for 
detecting  the  potentially  similar  word  occurrence  represen¬ 
tations 

•  merging  the  word  instances  considered  similar  by  the  pre¬ 
vious  step,  but  only  if  they  also  conform  to  a  criterion  other 
than  their  vector  distance 

A  distributional  criterion,  albeit  weak,  that  allows  discrim¬ 
ination  of  word  instances,  is  the  probability  distribution  of  the 
previous  and  next  word  occurrences.  While  the  representa¬ 
tions  of  words  reflect  local  information  (consecutive  states 
in  a  dynamical  system),  the  probability  distributions  carry 
global  information  that  spans  sentences. 

The  overall  processing  is: 


1.  obtain  vector  representations  of  word  occurrences,  using 
an  Elman-type  run 

2.  hierarchically  cluster  the  word  instances,  using  the  Euclid¬ 
ian  distance,  obtaining  a  binary  tree 

3.  create  initial  classes  of  words  from  the  leaves  of  the  tree 
belonging  to  the  same  subtree,  at  a  certain  low  level  in  the 
tree 

4.  “climb”  the  tree  only  if  the  two  children  of  the  current  in¬ 
ternal  node  represent  two  classes  that  can  be  merged  ac¬ 
cording  to  the  distributional  criterion 

5.  extract  the  DFA  from  the  classes  of  terminals  obtained  at 
the  previous  step,  by  considering  that  the  target  grammar 
is  Szilard 

4.  Implementation 

4.1.  The  language 

We  wrote  a  small  context  free  grammar  (CFG),  similar  to  the 
one  used  in  (Elman,  1992).  From  this  CFG  we  obtained  a 
regular  grammar  by  expanding  the  start  symbol  with  all  pos¬ 
sible  productions,  up  to  an  arbitrary  depth  in  the  derivation 
trees.  This  regular  grammar  can  generate  only  a  subset  of  the 
original  language.  Furthermore,  the  regular  grammar  is  used 
to  generate  sentences  no  longer  than  a  chosen  length.  The 
target  of  our  learning  system  is  this  finite  regular  language, 
which  is  exhaustively  presented  to  the  learner.  The  fact  that 
the  language  is  finite  has  no  influence  on  the  learning  algo¬ 
rithm,  which  neither  builds  the  trivial  DFA,  nor  does  it  as¬ 
sume  that  it  has  seen  the  entire  language  or  that  the  language 
is  finite. 

The  initial  grammar  we  used  is  given  in  figure  5.  The  gram¬ 
mar  encodes  no  context  constraints,  so  it  can  generate  sen¬ 
tences  like  “John  hears  John”,  which  are  unlikely  from  the 
semantic  point  of  view.  We  are  not  concerned  here  with  the 
semantic  content  of  words,  but  want  to  test  that  the  two  oc¬ 
currences  of  John,  which  are  syntactically  diferrent,  are  con¬ 
sidered  so  by  the  learning  system. 

4.2.  The  recurrent  network 

For  the  experiments  we  used  the  package  “tlearn”  (Plunkett  & 
Elman,  1997).  We  used  one  input  and  one  output  unit,  respec¬ 
tively,  for  each  word  in  the  language,  as  in  (Elman,  1990).  An 
extra  output  unit  encoded  the  (End_of_Sentence)  marker.  The 
number  of  hidden  units  was  equal  to  the  number  of  states  in 
the  target  DFA.  The  network  was  trained  on  the  classification 
problem:  predict  the  next  word  or  (End _of .Sentence).  We 
used  the  cross-entropy  function  as  the  error  function.  The 
cross-entropy  function,  when  applied  to  classification  tasks, 
was  shown  by  Rumelhart  et  al.(  1993)  to  make  a  network  learn 
the  probability  distribution  over  the  output  vectors.  The  net¬ 
work  state  was  reset  before  the  start  of  each  sentence,  in  or¬ 
der  to  avoid  the  instability  phenomenon  mentioned  in  (Kolen, 
1994).  In  our  experiments  this  training  regime  gave  the  best 
results  in  terms  of  word  clustering. 

For  small  languages  we  inspected  the  output  units  and  as 
expected,  they  encoded  a  close  approximation  of  the  proba- 
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Figure  6:  Fragment  of  initial  tree 


bility  distribution  over  the  next  words.  For  larger  languages, 
the  approximation  became  less  accurate. 

4.3.  The  merging  algorithm 

The  algorithm  relies  on  the  initial  clustering  of  word  in¬ 
stances,  based  on  their  vector  representation.  This  vector  was 
obtained  by  concatenating  the  context  and  hidden  layers  in 
the  network.  We  used  a  simple  hierarchical  clustering  algo¬ 
rithm  that  yields  a  binary  tree.  The  word  instances  that  have 
almost  identical  vectors  are  placed  by  the  algorithm,  in  sub¬ 
trees  at  low  levels  in  the  tree,  as  illustrated  in  figure  6.  These 
subtrees  form  the  initial  classes  of  word  instances. 

After  the  initialization  stage  the  tree  can  be  viewed  as  in 
figure  7.  The  merging  algorithm  proceeds  then  by  trying  to 
merge  classes  of  words  associated  with  sibling  nodes  in  the 

There  are  4  instances  of  "Mary”  with 
identical  vectors.  All  were  placed  in 
a  subtree  at  the  third  level  from  the 
fringe  and  now  form  one  class.  The 
same  holds  for  “John”. 

Figure  7:  Fragment  of  the  tree  after  the  initialization  step. 


tree.  For  the  example  in  figure  7,  the  two  classes  “John”  and 
“Mary”  are  proposed.  The  criterion  for  merging  is  the  simi¬ 
larity  of  their  probability  distributions  over  the  next  and  pre¬ 
vious  word  classes.  That  is,  “John"  and  “Mary”  are  merged 
into  one  class  if  they  tend  to  be  preceded  and  followed  by  the 
same  word  classes.  We  use  the  G  statistic,  (Cohen,  1995)  to 
test  if  there  is  a  statistically  significant  difference  between  the 
two  probability  distributions. 

The  G  statistic  has  a  ;\;2  distribution  whose  formula  is: 
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The  merging  process  continues  until  no  more  merges  are 
possible. 


4.4.  The  DFA  extraction  algorithm 

We  do  not  know  exactly  what  information  is  encoded  in  the 
word  representations  obtained  by  the  run.  We  can  only  as¬ 
sume  that  these  word  instances  uniquely  identify  either  tran¬ 
sitions  or  states  in  the  target  DFA.  Both  DFA  extraction  al¬ 
gorithms,  as  presented  in  section  2,  can  be  applied  after  the 
word  classes  that  define  the  grammar  terminals  are  formed 
during  the  previous  stage.  It  can  be  immediately  observed 
that  if  there  is  a  one-to-one  correspondence  between  the  ob¬ 
tained  run  states  and  the  states  of  the  original  DFA,  then  both 
extraction  algorithm  will  recover  the  target  automaton. 

The  context  free  grammar  in  figure  5  was  expanded  at 
depths  1,  2  and  3  in  order  to  obtain  regular  grammars  that 
approximate  the  original  grammar.  These  regular  grammars, 
named  “elm_rl”and  “elm_r2”,  were  then  used  to  generate 
sentences  of  up  to  3,  4,  5  and  6  words.  The  resulting  lan¬ 
guages  are  “elm_rl_d3”,  “elm_rl_d4”  for  sentences  of  3  and  4 
words,  from  the  regular  grammar  “elmjT”,  and  “elm_r2 _d5” 
and  “elm_r3_d6”,  respectively. 

Both  Szilard  and  Szilard*  regular  grammars  were  induced 
for  all  languages.  The  results  are  shown  in  table  1. 

The  original  and  the  induced  automata  for  the  language 
“elm_rl_d4”  can  be  seen  in  figure  8. 

In  all  the  induced  automata,  there  could  be  observed 
classes  formed  from  the  same  words.  For  example  “sees  hears 
walks  lives”  appear  on  different  transitions  in  figure  8.  From 
figure  8.b.  it  follows  that  there  are  three  such  classes,  associ¬ 
ated  with  the  states  “q2”,  “q9”  and  “qll”.  These  classes  are 
formed  from  non-overlapping  sets  of  occurrences  of  the  four 
words  which  were  not  merged,  and  so  are  considered  distinct 
classes.  They  were  not  merged  because  their  vector  represen¬ 
tations,  as  extracted  from  the  network,  are  not  similar. 

For  the  slightly  larger  languages,  “elm_r2jd5”  and 
“elm_r3_d6”,  the  induced  grammars  no  longer  recognize  ex¬ 
actly  the  target  languages,  but  supersets  of  them.  A  sample  of 
correct  and  incorrect  sentences  can  be  seen  in  table  2.  From 
the  sentences  listed  there  it  can  be  noticed  that  some  distinc¬ 
tions  which  were  encoded  in  the  original  grammar  are  not 
learned.  Such  is  the  distinction  between  verbs  that  require  a 
direct  object  and  verbs  for  which  it  is  optional.  This  type  of 
error  yields  incorrect  sentences,  like  “the  boys  feed”.  Another 


language 

name 

Original 

DFA 

Szilard* 

DFA 

Szilard 

DFA 

elm_rl  _d3 
40  sent. 

6  states 

16  states 

8  states 

finite  lang. 

finite  lang. 

41)  corr.sent. 

0  err.sent 

4U  corr.sent. 

0  err.sent 

elm_rl  _d4 

8  states 

20  states 

12  states 

56  sent. 

finite  lang. 

finite  lang. 

56  corr.sent. 

0  err.sent 

56  corr.sent. 

0  err.sent 

elm_r2_d5 

22  states 

47  states 

31  states 

512  sent. 

finite  lang. 

finite  lang. 

512  corr.sent. 

43  err.sent 

512  corr.sent. 
258  err.sent 

elm_r3_d6 

31  states 

80  states 

63  states 

1184  sent. 

infinite  lang. 

infinite  lang. 

*  1 184  corr.sent. 
1133  err.sent 

*  1184  corr.sent. 
4340  err.sent 

Table  1 :  The  original  and  induced  automata  for  the  four  in¬ 
creasingly  complex  sub-languages  of  the  grammar  in  figure 
5.  *  The  sentences  were  obtained  by  imposing  a  limit  of  7 
words  on  the  sentence  length. 


Figure  8:  a.  Original  DFA  for  the  language  elm_rljd4.  b. 
Induced  Szilard*DFA  .  c.  Induced  Szilard  DFA  . 


missed  distinction  is  between  verbs  that  require  a  human  sub¬ 
ject  and  those  for  which  it  is  optional:  the  same  “feed”,  but 
in  “the  cats  feed  Mary”.  Some  other  sentences  are  grammat- 


ically  correct  generalizations,  that  are  longer  than  the  num¬ 
ber  of  words  allowed  by  their  original  grammars.  Such  is 
“the  boys  who  see  hear  John”  in  table  2,  which  is  generated 
by  the  “elm_r2_d5”  Szilard  DFA.  Although  this  sentence  is 
not  present  in  the  original  language,  it  actually  appears  in  the 
larger  language,  “elm_r3  jd6”. 


elm_r2_d5 

correct  sentences 

incorrect  sentences 

“the  boys  feed  the  cats.” 
“John  and  Mary  see.” 
“the  dogs  chase  Mary.” 
“the  dogs  who  see  live.” 

Szilard* 

“the  boys  feed.” 

“John  and  Mary  walk  Mary.” 
“the  dogs  chase.” 

“the  cats  feed  Mary.” 

Szilard 

"the  boys  who  see 
hear  John.” 

“the  girl  walks  the  dog.” 

“the  dog  lives  the  boy.” 

elm_r3_d6 

correct  sentences 

incorrect  sentences 

“the  girl  feeds 

John  and  John.” 

“the  girl  who  walks 
feeds  John.” 

“the  cats  who  walk 
chase  John.” 

“the  boys  who  see 
hear  John.” 

“the  boy  who  John 
feeds  lives.” 

“Mary  and  John  see 
the  girls.” 

Szilard* 

“Mary  and  John  feed.” 

“the  girl  chases 
the  girl  feeds  John.” 

“the  girl  who  walks 
walks  Mary.” 

“the  girls  who  live 
feed  the  dogs.” 

Szilard 

“John  and  John  chase 

John  and  Mary.” 

“John  and  Mary  live 
the  girl.” 

“the  boy  chases  the  boy 
hears  Mary.” 

Table  2:  Samples  of  correct  and  incorrect  sentences  for  lan¬ 
guages  elm_r2_d5  and  elm_r3jd6. 


6.  Conclusions 

We  showed  that  good  approximations  of  regular  grammars 
can  be  learned  from  positive  examples  by  considering  each 
word  occurrence  unique  and  then  merging  these  occurrences 
into  classes  of  grammar  terminals.  While  the  results  for  the 
languages  presented  were  quite  good,  we  expect  the  learn¬ 
ing  system  to  perform  less  well  for  more  complex  grammars. 
There  are  at  least  two  reasons  for  this  to  happen.  The  first  one 
is  due  to  the  behaviour  of  the  recurrent  network  which  due  to 
its  huge  parameter  space  can  almost  always  find  a  function 
that  predicts  the  probability  distribution  over  the  next  words, 
but  the  hidden  units  do  not  encode  the  DFA  states.  The  sec¬ 
ond  reason  is  that  the  distributional  criterion  is  weak:  the 
previous  and  next  word  probability  distributions  do  not  en¬ 
code  enough  information  to  distinguish  words.  For  example. 


because  of  sentences  like  “Mary  sees”  and  “John  and  Mary 
see”,  the  instances  of  “see”  and  “sees”  can  be  merged,  if  such 
a  merge  is  proposed  by  the  run. 

Human  language  learners  have  access  to  a  vital  source  of 
information  that  is  unavailable  to  our  algorithms,  the  context 
in  which  a  sentence  is  uttered.  We  hypothesize  that  by  adding 
additional  information  about  the  states,  in  terms  of  seman¬ 
tic  content  of  the  current  word,  the  network  search  space  can 
be  reduced  such  that  the  network  is  more  likely  to  find  the 
desired  function.  Furthermore,  if  the  word  instances  are  dis¬ 
tinguished  by  additional  information,  we  have  better  grounds 
to  treat  the  original  language  as  Szilard.  It  is  appealing  to 
consider  that  this  is  actually  the  case  with  natural  languages, 
where  it  is  the  context  that  makes  the  distinction  between 
word  occurrences. 
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